Data corruption

Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer storage and transmission systems use a number of measures to provide data integrity, or lack of errors.

In general, when data corruption occurs, the file containing that data may become inaccessible, and the system or the related application will give an error. For example, if a Microsoft Word file is corrupted, when you try to open that file with MS Word, you will get an error message, and the file would not be opened. Some programs can give a suggestion to repair the file automatically (after the error), and some programs cannot repair it. It depends on the level of corruption, and the in-built functionality of the application to handle the error. There are various causes of the corruption.

1 Transmission
2 Storage
3 Countermeasures
4 See also
- 4.1 Solutions
5 References

Transmission

Data corruption during transmission has a variety of causes. Interruption of data transmission causes information loss. Environmental conditions can interfere with data transmission, especially when dealing with wireless transmission methods. Heavy clouds can block satellite transmissions. Wireless networks are susceptible to interference from devices such as microwave ovens.

Storage

Data loss during storage has two broad causes: hardware and software failure. Background radiation, head crashes, and aging or wear of the storage device fall into the former category, while software failure typically occurs due to bugs in the code.

Error detection and correction may occur in the hardware, the disk subsystem or adapter, or software which implements error checking and correction (i.e., RAID software such as mdadm for Linux).

There are two types of data loss:

Undetected- also known as "silent corruption". These problems have been attributed to errors during the write process to disk. These are the most dangerous errors as there is no indication that the data is incorrect.
Detected- these errors are most often caused by disk drive problems. Errors may either permanent or temporary, where temporary errors are able to be overcome when the operation is repeated by the hardware. Errors are normally detected by the hardware, either by the disk drive by checking the data read from the disk using the ECC/CRC error correcting code stored alongside the data on disk, or in the case of a RAID array by comparing the contents of the RAID strips with the ECC checksum or parity of the RAID stripe.

Countermeasures

Main article: Error detection and correction

When data corruption behaves as a Poisson process, where each bit of data has an independently low probability of being changed, data corruption can generally be detected by the use of checksums, and can often be corrected by the use of error correcting codes.

If an uncorrectable data corruption is detected, procedures such as automatic retransmission or restoration from backups can be applied. Certain levels of RAID disk arrays have the ability to store and evaluate parity bits for data across a set of hard disks and can reconstruct corrupted data upon the failure of a single or multiple disks, depending on the level of RAID implemented.

Today, many errors are detected and corrected by the disk drive using the ECC/CRC codes^[1] which are stored on disk for each sector. If the disk drive detects multiple read errors on a sector it may make a copy of the failing sector on another part of the disk- remapping the failed sector of the disk to a spare sector without the involvement of the operating system (though this may be delayed until the next write to the sector).

This "silent correction" can lead to other problems if disk storage is not managed well, as the disk drive will continue to remap sectors until it runs out of spares, at which time the temporary correctable errors can turn into permanent ones as the disk drive deteriorates. S.M.A.R.T. provides a standardized way of monitoring the health of a disk drive, and there are tools available for most operating systems to automatically check the disk drive for impending failures by watching for deteriorating SMART parameters.

"Data scrubbing" is another method to reduce the likelihood of data corruption, as disk errors are caught and recovered from, before multiple errors accumulate and overwhelm the number of parity bits. Instead of parity being checked on each read, the parity is checked during a regular scan of the disk, often done as a low priority background process. Note that the "data scrubbing" operation activates a parity check. If a user simply runs a normal program that reads data from the disk, then the parity would not be checked unless parity-check-on-read was both supported and enabled on the disk subsystem.

If appropriate mechanisms are employed to detect and remedy data corruption, data integrity can be maintained. This is particularly important in commercial applications (e.g. banking), where an undetected error could either corrupt a database index or change data to drastically affect an account balance, and in the use of encrypted or compressed data, where a small error can make an extensive dataset unusable.^[2] It is worth noting that while the study by CERN has been often referenced as showing large levels of data corruption, the disk subsystem which was the subject of the paper was set up with RAID5 and a single parity bit (hence could not recover from a single "silent" error), did not use parity-check-on-read (and hence could not detect "silent errors" through parity checking of the RAID stripe), and did not use data scrubbing. The disk storage was also subject to a microcode software bug which caused higher levels of errors than normal.^[3]

References

^ "Read Error Severities and Error Management Logic". http://www.storagereview.com/guide/errorRead.html. Retrieved 24 July 2011.
^ Data Integrity by Cern April 2007 Cern.ch
^ Bernd Panzer-Steindel. "Data integrity". http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797. There are some correlations with known problems, like the problem where disks drop out of the RAID5 system on the 3ware controllers. After some long discussions with 3Ware and our hardware vendors this was identified as a problem in the WD disk firmware.

	Computer Science portal
	Computing portal

Data corruption

Contents

Transmission

Storage

Countermeasures

See also

Solutions

References